TeMex: The Web Template Extractor

نویسندگان

  • Julián Alarte
  • David Insa
  • Josep Silva
  • Salvador Tamarit
چکیده

This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a predefined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bridging the Gap: from Multi Document Template Detection to Single Document Content Extraction

Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter...

متن کامل

Security Efficiency Analysis of a Biometric Fuzzy Extractor for Iris Templates

A Biometric fuzzy extractor scheme for iris templates was recently presented in [3]. This fuzzy extractor binds a cryptographic key with the iris template of a user, allowing to recover such cryptographic key by authenticating the user by means of a new iris template from her. In this work, an analysis of the security efficiency of this fuzzy extractor is carried out by means of a study about t...

متن کامل

WS-NEXT, a Web Services Network Extractor Toolkit

In this article, a Web services network extractor toolkit, WS-NEXT (WS Network EXtractor Toolkit), is presented. WS-NEXT allows extraction of interaction and dependency WS networks. Networks can be extracted from syntactic and semantic WS descriptions. Such network structures can be analyzed using complex network tools. We provide examples of networks extracted from a publicly available WS coll...

متن کامل

A secure authentication scheme based on fuzzy extractor

The biometrics-based authentication schemes are more security and reliable than the traditional authentication schemes, and it is the inevitable trend of future development. However, between the existing schemes, the security of user’s biometric template usua lly be ignored, the user’s information security suffering from a great threat because of that. Recently, Yan et al. proposed a secure bio...

متن کامل

Enhancing the Invisible Web

In recent years, a large amount of information has been placed in databases across the globe, and published through dynamically generated Web pages. The evolution of the so-called Invisible (or Hidden) Web constitutes both an opportunity and an issue for Web-based information extractors. This article describes the architecture of an Invisible-Web Extractor, whose primal goal is to enhance the v...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015